In my previous blog post, we got to know the idea of "indentation-based complexity". We took a static view on the Linux kernel to spot the most complex areas.
This time, we wanna track the evolution of the indentation-based complexity of a software system over time. We are especially interested in it's correlation between the lines of code. Because if we have a more or less stable development of the lines of codes of our system, but an increasing number of indentation per source code file, we surely got a complexity problem.
Again, this analysis is higly inspired by Adam Tornhill's book "Software Design X-Ray" , which I currently always recommend if you want to get a deep dive into software data analysis.
For the calculation of the evolution of our software system, we can use data from the version control system. In our case, we can get all changes to Java source code files with Git. We just need so say the right magic words, which is
git log -p -- *.java
This gives us data like the following:
commit e5254156eca3a8461fa758f17dc5fae27e738ab5
Author: Antoine Rey <antoine.rey@gmail.com>
Date: Fri Aug 19 18:54:56 2016 +0200
Convert Controler's integration test to unit test
diff --git a/src/test/java/org/springframework/samples/petclinic
/web/CrashControllerTests.java b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
index ee83b8a..a83255b 100644
--- a/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
+++ b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
@@ -1,8 +1,5 @@
package org.springframework.samples.petclinic.web;
-import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.get;
-import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*;
-
import org.junit.Before;
import org.junit.Test;
import org.junit.runner.RunWith;
We have the
commit e5254156eca3a8461fa758f17dc5fae27e738ab5
Author: Antoine Rey <antoine.rey@gmail.com>
Date: Fri Aug 19 18:54:56 2016 +0200
Convert Controler's integration test to unit test
diff --git a/src/test/java/org/springframework/samples/petclinic
/web/CrashControllerTests.java b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
index ee83b8a..a83255b 100644
--- a/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
+++ b/src/test/java/org/springframework/samples/petclinic/web/CrashControllerTests.java
and the full file diff where we can see additions or modifications (+
) and deletions (-
)
package org.springframework.samples.petclinic.web;
-import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.get;
-import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.*;
-
import org.junit.Before;
We "just" have to get this data into our favorite data analysis framework, which is, of course, Pandas :-). We can actually do that! Let's see how!
In [19]:
import pandas as pd
diff_raw = pd.read_csv(
"../../buschmais-spring-petclinic_fork/git_diff.log",
sep="\n",
names=["raw"])
diff_raw.head(5)
Out[19]:
In [20]:
diff_raw[diff_raw.raw.str.startswith("commit")].head()
Out[20]:
The output is the commit data that I've describe above where each in line the text file represents one row in the DataFrame (without blank lines).
We skip all the data we don't need for sure. Especially the "extended index header" with the two lines that being with +++
and ---
are candidates to mix with the real diff data that begins also with a +
or a -
. Furtunately, we can identify these rows easily: These are the rows that begin with the row that starts with index
. Using the shift
operation starting at the row with index
, we can get rid of all those lines.
In [21]:
index_row = diff_raw.raw.str.startswith("index ")
ignored_diff_rows = (index_row.shift(1) | index_row.shift(2))
diff_raw = diff_raw[~(index_row | ignored_diff_rows)]
diff_raw.head(10)
Out[21]:
Next, we extract some metadata of a commit. We can identify the different entries by using a regular expression that looks up a specific key word for each line. We extract each individual information into a new Series/column because we need it for each change line during the software's history.
In [22]:
diff_raw['commit'] = diff_raw.raw.str.split("^commit ").str[1]
diff_raw['timestamp'] = pd.to_datetime(diff_raw.raw.str.split("^Date: ").str[1])
diff_raw['path'] = diff_raw.raw.str.extract("^diff --git.* b/(.*)", expand=True)[0]
diff_raw.head()
Out[22]:
To assign each commit's metadata to the remaining rows, we forward fill those rows with the metadata by using the fillna
method.
In [23]:
diff_raw = diff_raw.fillna(method='ffill')
diff_raw.head(8)
Out[23]:
In [24]:
diff_raw["i"] = diff_raw.raw.str[1:].str.len() - diff_raw.raw.str[1:].str.lstrip().str.len()
diff_raw.head()
Out[24]:
In [25]:
%%timeit
diff_raw['added'] = diff_raw.raw.str.extract("^\+( *).*$", expand=True)[0].str.len()
diff_raw['deleted'] = diff_raw.raw.str.extract("^-( *).*$", expand=True)[0].str.len()
diff_raw.head()
For our later indentation-based complexity calculation, we have to make sure that each line
In [26]:
diff_raw['line'] = diff_raw.raw.str.replace("\t", " ")
diff_raw.head()
Out[26]:
In [27]:
diff = \
diff_raw[
(~diff_raw['added'].isnull()) |
(~diff_raw['deleted'].isnull())].copy()
diff.head()
Out[27]:
In [28]:
diff['is_comment'] = diff.line.str[1:].str.match(r' *(//|/*\*).*')
diff['is_empty'] = diff.line.str[1:].str.replace(" ","").str.len() == 0
diff['is_source'] = ~(diff['is_empty'] | diff['is_comment'])
diff.head()
Out[28]:
In [29]:
diff.raw.str[0].value_counts()
Out[29]:
In [30]:
diff['lines_added'] = (~diff.added.isnull()).astype('int')
diff['lines_deleted'] = (~diff.deleted.isnull()).astype('int')
diff.head()
Out[30]:
In [31]:
diff = diff.fillna(0)
#diff.to_excel("temp.xlsx")
diff.head()
Out[31]:
In [32]:
commits_per_day = diff.set_index('timestamp').resample("D").sum()
commits_per_day.head()
Out[32]:
In [33]:
%matplotlib inline
commits_per_day.cumsum().plot()
Out[33]:
In [34]:
(commits_per_day.added - commits_per_day.deleted).cumsum().plot()
Out[34]:
In [35]:
(commits_per_day.lines_added - commits_per_day.lines_deleted).cumsum().plot()
Out[35]:
In [36]:
diff_sum = diff.sum()
diff_sum.lines_added - diff_sum.lines_deleted
Out[36]:
In [37]:
3913
Out[37]: